SYS 6018 | Spring 2021 | University of Virginia
Tell the reader what this project is about. Motivation.
Load data, explore data, etc.
# Load Required Packages
library(tidyverse)
library(pROC)
library(randomForest)
library("GGally")
library(gridExtra)
library(plotly)
#library(reticulate)
library(regclass)
#library(ROSE)
library(MLeval)
library(ggplot2)
library(purrr)
library(broom)url = 'HaitiPixels.csv'
#url = 'https://collab.its.virginia.edu/access/lessonbuilder/item/1707832/group/17f014a1-d43d-4c78-a5c6-698a9643404f/Module3/HaitiPixels.csv' #this url is beng
haiti <- read_csv(url)
print(dim(haiti))#> [1] 63241 4
head(haiti)#> # A tibble: 6 x 4
#> Class Red Green Blue
#> <chr> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
The dataframe contains 4 columns, and 63,241 rows. The Class column contains the correct label for the observation. Red, Green and Blue parameters are NEED TO INCLUDE CORRECT DEFINITION
To prepare the data for exploratory data analysis I make Class a factor.
haiti %>%
mutate(Class = factor(Class)) #> # A tibble: 63,241 x 4
#> Class Red Green Blue
#> <fct> <dbl> <dbl> <dbl>
#> 1 Vegetation 64 67 50
#> 2 Vegetation 64 67 50
#> 3 Vegetation 64 66 49
#> 4 Vegetation 75 82 53
#> 5 Vegetation 74 82 54
#> 6 Vegetation 72 76 52
#> 7 Vegetation 71 72 51
#> 8 Vegetation 69 70 49
#> 9 Vegetation 68 70 49
#> 10 Vegetation 67 70 50
#> # ... with 63,231 more rows
Examine the numbers and percentages in each of the 5 classes:
haiti %>%
group_by(Class) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 5 x 3
#> Class N Perc
#> * <chr> <int> <dbl>
#> 1 Blue Tarp 2022 3
#> 2 Rooftop 9903 16
#> 3 Soil 20566 33
#> 4 Various Non-Tarp 4744 8
#> 5 Vegetation 26006 41
The records are not evenly distributed between the categories. Of the Classes Blue Tarp, our “positive” category if we are thinking a binary positive/negative identification, is only 3% of our sample. Soil and Vegetation make up the majority of our sample at 74%.
It will be interesting to see performance predicting each of these categories, or a binary is or is not Blue Tarp.
After reviewing box plots for the 2-class data set, I also created two new calculated variables:
1. GBSqr = (Green + Blue)^2 * .001
2. RBSqr = (Red + Blue)^2 * .001
I created these to continue using the Red and Green values, but I wanted to increase the difference in median value difference between the positive and negative classes. There is significant interplay in color values between Red, Green, and Blue in identifying the correct shade or blue, and I wanted to continue using Red and Green values but increase the linear separability between the classes. The 0.01 multiplier is to return the number scale to a range similar to standard RGB values.
haitiBinary = haiti %>%
mutate(ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))
haitiBinarySqrs = haiti %>%
mutate(GBSqr = I(((Green + Blue)^2) * .001), RBSqr = I(((Red + Blue)^2) * .001), ClassBinary = if_else(Class == 'Blue Tarp', '1', '0'), ClassBinary = factor(ClassBinary))Examine the numbers and percentages in each of the 2 classes:
haitiBinary %>%
group_by(ClassBinary) %>%
summarize(N = n()) %>%
mutate(Perc = round(N / sum(N), 2) * 100)#> # A tibble: 2 x 3
#> ClassBinary N Perc
#> * <fct> <int> <dbl>
#> 1 0 61219 97
#> 2 1 2022 3
redplot <- ggplot(haiti, aes(x=Class, y=Red)) +
geom_boxplot(col='red')
greenplot <- ggplot(haiti, aes(x=Class, y=Green)) +
geom_boxplot(col='darkgreen')
blueplot <- ggplot(haiti, aes(x=Class, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplot, greenplot, blueplot)redplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Red)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Green)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinary, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB) ### How are red, blue and green values distributed between the 2 classes with the square values for Red + Blue and Green Blue?
redplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=RBSqr)) +
geom_boxplot(col='red')
greenplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=GBSqr)) +
geom_boxplot(col='darkgreen')
blueplotB <- ggplot(haitiBinarySqrs, aes(x=ClassBinary, y=Blue)) +
geom_boxplot(col='darkblue')
grid.arrange(redplotB, greenplotB, blueplotB)For the 5-class box plots:
“Blue Tarp” as the “positive” result, and other results as the “negative” result.
Regarding the box plot of the five categories, of interest is that “Soil” and “Vegetation” are relatively unique in their RGB distributions. “Rooftop” and “Various Non-Tarp” are more similar in their RBG distributions
For the 2-class box plots:
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the ranges of red and green are much smaller for blue tarp than non-blue-tarp.
Generally, the values of red have a larger range for negative results than for positive results, and the positive results have a similar median to the negative results.
Green values have a larger range for negative results than for positive results, and the positive results have a higher median than the negative results.
There is almost no overlap in the blue data with non-blue tarps, and blue tarps.
For the 2-class box plots with the additive square values:
If the classes are collapsed to binary values of “Blue Tarp (1)” and “Not Blue Tarp (0)” there is little overlap in the blue values for the two classes, and the RBSqr and GBSqr values have much less overlap than without the additive square variables.
The values of RBSqr have a larger range for negative results than for negative results, and median is significantly greater in the positive results.
GBSqr values have a larger range for negative results than for positive results. The positive results have a significantly higher median than the negative results.
There is almost no overlap in the blue data with non-blue tarps, and blue tarps.
These correlations make sense as the pixels were of highly saturated colors, that are not pure Blue, Red or Green. There are few pixels in the data set with low values for R,G,B.
#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haiti, progress = F)#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#ggpairs(haiti, lower = list(continuous = "points", combo = "dot_no_facet"), progress = F)
ggpairs(haitiBinary[-1], progress = F)#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(haitiBinarySqrs[-1], progress = F)#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_density()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
#> Warning: Computation failed in `stat_bin()`:
#> attempt to apply non-function
The RBSqr and GBSqr have significantly less variance in their values, and better differentiation between the 2 classes than the Red and Green variables. I will be using these transformed variables in my models.
To view the relationship between the Red, Green, and Blue values between the five classes, and the binary classes, an interactive 3-D scatter plot is illustrative.
fiveCat3D = plot_ly(x=haiti$Red, y=haiti$Blue, z=haiti$Green, type="scatter3d", mode="markers", color=haiti$Class, colors = c('blue2','azure4','chocolate4','coral2','chartreuse4'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
fiveCat3D = fiveCat3D %>%
layout(title="5 Category RBG Plot", scene = list(xaxis = list(title = "Red", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "Green", color="green")))
fiveCat3D5-Class 3-D Scatter Plot Observations
One can see that there are discernible groupings of pixel categories by RGB values. Unsurprisingly, the blue tarps are higher blue values, but they do have a range of red and green values.
The 3D scatter plot is particularly useful because, by zooming in, one can see that while the ‘Blue Tarp’ values are generally distinct, there is a space in the 3D plot with mingling of “blue tarp” pixels and other pixel categories. That area of the data will provide a challenge for our model.
binary3D = plot_ly(x=haitiBinarySqrs$RBSqr, y=haitiBinarySqrs$Blue, z=haitiBinarySqrs$GBSqr, type="scatter3d", mode="markers", color=haitiBinary$ClassBinary, colors = c('red','blue2'),
marker = list(symbol = 'circle', sizemode = 'diameter', opacity =0.35))
binary3D = binary3D %>%
layout(title="Binary RBG Plot", scene = list(xaxis = list(title = "RBSqr", color="red"), yaxis = list(title = "Blue", color="blue"), zaxis = list(title = "GBSqr", color="green")))
binary3D2-Class 3-D Scatter Plot Observations With Blue, GBSqr, and RBSqr
Similar to the five category 3D scatter plot, the binary scatter plot shows distinct groupings for blue tarp and non-blue-tarp. There is a clear linear boundary between the blue tarp and non-blue tarp observations.
I will hold out 20% of the data set for testing/validation.
library(caret)
library(boot)train = haitiBinarySqrs
#set.seed(1976)
#sample_size = floor(0.8*nrow(haitiBinarySqrs))
# randomly split data in r
#picked = sample(seq_len(nrow(haitiBinarySqrs)),size = sample_size)
#train = haitiBinarySqrs[picked,]
#test = haitiBinarySqrs[-picked,]For logistic regression, LDA, QDA, and ???KNN Cross-Validation threshold performance used ROC for tuning.
The following performance measures are collected for both the 10-fold cross-validation and the hold-out/testing/validation data:
For the Models: * No: Not a Blue Tarp is Negative * Yes: Is a Blue Tarp is Positive
Per our course’s Module 3 instruction, logistic regression is typically used when there are 2 classes. I will be using the haitiBinary dataframe with two classes:
Reset the level names to enable the caret functions for the ROC curve.
levels(train$ClassBinary)#> [1] "0" "1"
levels(train$ClassBinary)=c("No","Yes")
levels(train$ClassBinary)#> [1] "No" "Yes"
fct_count(train$ClassBinary)#> # A tibble: 2 x 2
#> f n
#> <fct> <int>
#> 1 No 61219
#> 2 Yes 2022
set.seed(1976)
# number: number of folds for cross validation
trctrl <- trainControl(method = "repeatedcv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, number = 10, repeats = 2)
log.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "glmnet", trControl=trctrl, tuneLength = 10)#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
log.cv.model#> glmnet
#>
#> 63241 samples
#> 5 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times)
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> alpha lambda ROC Sens Spec
#> 0.1 1.922525e-05 0.9994333 0.9992894 0.8550980
#> 0.1 4.441283e-05 0.9994213 0.9993139 0.8536141
#> 0.1 1.025994e-04 0.9991691 0.9996080 0.8323501
#> 0.1 2.370179e-04 0.9985715 0.9999020 0.7875933
#> 0.1 5.475421e-04 0.9973944 1.0000000 0.7030300
#> 0.1 1.264893e-03 0.9943187 1.0000000 0.5677669
#> 0.1 2.922068e-03 0.9864913 1.0000000 0.4448581
#> 0.1 6.750356e-03 0.9711792 1.0000000 0.1026179
#> 0.1 1.559420e-02 0.9355043 1.0000000 0.0000000
#> 0.1 3.602462e-02 0.8838932 1.0000000 0.0000000
#> 0.2 1.922525e-05 0.9995102 0.9988892 0.8706762
#> 0.2 4.441283e-05 0.9994448 0.9992649 0.8575721
#> 0.2 1.025994e-04 0.9992065 0.9995508 0.8348193
#> 0.2 2.370179e-04 0.9986487 0.9999020 0.7925377
#> 0.2 5.475421e-04 0.9975380 1.0000000 0.7141552
#> 0.2 1.264893e-03 0.9947265 1.0000000 0.5816161
#> 0.2 2.922068e-03 0.9872496 1.0000000 0.4500537
#> 0.2 6.750356e-03 0.9717887 1.0000000 0.1184510
#> 0.2 1.559420e-02 0.9337115 1.0000000 0.0000000
#> 0.2 3.602462e-02 0.8745199 1.0000000 0.0000000
#> 0.3 1.922525e-05 0.9995214 0.9988157 0.8741379
#> 0.3 4.441283e-05 0.9994656 0.9991588 0.8593023
#> 0.3 1.025994e-04 0.9992423 0.9994773 0.8372933
#> 0.3 2.370179e-04 0.9987313 0.9998857 0.7992160
#> 0.3 5.475421e-04 0.9976990 1.0000000 0.7228076
#> 0.3 1.264893e-03 0.9951103 1.0000000 0.5999122
#> 0.3 2.922068e-03 0.9880966 1.0000000 0.4569807
#> 0.3 6.750356e-03 0.9725693 1.0000000 0.1377457
#> 0.3 1.559420e-02 0.9337141 1.0000000 0.0000000
#> 0.3 3.602462e-02 0.8592215 1.0000000 0.0000000
#> 0.4 1.922525e-05 0.9995254 0.9987504 0.8775996
#> 0.4 4.441283e-05 0.9994775 0.9991424 0.8617714
#> 0.4 1.025994e-04 0.9992780 0.9994528 0.8422365
#> 0.4 2.370179e-04 0.9988200 0.9997876 0.8088572
#> 0.4 5.475421e-04 0.9978658 0.9999837 0.7364105
#> 0.4 1.264893e-03 0.9954638 1.0000000 0.6157367
#> 0.4 2.922068e-03 0.9888309 1.0000000 0.4698422
#> 0.4 6.750356e-03 0.9732470 1.0000000 0.1592560
#> 0.4 1.559420e-02 0.9349837 1.0000000 0.0000000
#> 0.4 3.602462e-02 0.8502785 1.0000000 0.0000000
#> 0.5 1.922525e-05 0.9995315 0.9986932 0.8800712
#> 0.5 4.441283e-05 0.9994884 0.9990362 0.8635041
#> 0.5 1.025994e-04 0.9993280 0.9993874 0.8456994
#> 0.5 2.370179e-04 0.9989185 0.9997305 0.8135578
#> 0.5 5.475421e-04 0.9980286 0.9999837 0.7514900
#> 0.5 1.264893e-03 0.9958251 1.0000000 0.6375006
#> 0.5 2.922068e-03 0.9894547 1.0000000 0.4881371
#> 0.5 6.750356e-03 0.9732489 1.0000000 0.1955982
#> 0.5 1.559420e-02 0.9328537 1.0000000 0.0000000
#> 0.5 3.602462e-02 0.8502785 1.0000000 0.0000000
#> 0.6 1.922525e-05 0.9995352 0.9986524 0.8840267
#> 0.6 4.441283e-05 0.9994988 0.9989627 0.8674609
#> 0.6 1.025994e-04 0.9993739 0.9993058 0.8503987
#> 0.6 2.370179e-04 0.9990196 0.9996406 0.8219639
#> 0.6 5.475421e-04 0.9982071 0.9999673 0.7655928
#> 0.6 1.264893e-03 0.9962020 1.0000000 0.6595120
#> 0.6 2.922068e-03 0.9901824 1.0000000 0.5014925
#> 0.6 6.750356e-03 0.9741437 1.0000000 0.2477759
#> 0.6 1.559420e-02 0.9306178 1.0000000 0.0000000
#> 0.6 3.602462e-02 0.8502785 1.0000000 0.0000000
#> 0.7 1.922525e-05 0.9995399 0.9986034 0.8889723
#> 0.7 4.441283e-05 0.9995118 0.9988811 0.8726552
#> 0.7 1.025994e-04 0.9994190 0.9992649 0.8543555
#> 0.7 2.370179e-04 0.9991234 0.9996161 0.8313625
#> 0.7 5.475421e-04 0.9983812 0.9999673 0.7833927
#> 0.7 1.264893e-03 0.9965943 1.0000000 0.6844876
#> 0.7 2.922068e-03 0.9909783 1.0000000 0.5232527
#> 0.7 6.750356e-03 0.9758676 1.0000000 0.3014400
#> 0.7 1.559420e-02 0.9282502 1.0000000 0.0000000
#> 0.7 3.602462e-02 0.8502785 1.0000000 0.0000000
#> 0.8 1.922525e-05 0.9995437 0.9985380 0.8931766
#> 0.8 4.441283e-05 0.9995260 0.9987422 0.8778471
#> 0.8 1.025994e-04 0.9994691 0.9991179 0.8617727
#> 0.8 2.370179e-04 0.9992252 0.9994528 0.8405075
#> 0.8 5.475421e-04 0.9985716 0.9998693 0.7979771
#> 0.8 1.264893e-03 0.9969599 0.9999918 0.7129213
#> 0.8 2.922068e-03 0.9921717 1.0000000 0.5462493
#> 0.8 6.750356e-03 0.9779321 1.0000000 0.3669743
#> 0.8 1.559420e-02 0.9252658 1.0000000 0.0000000
#> 0.8 3.602462e-02 0.8502785 1.0000000 0.0000000
#> 0.9 1.922525e-05 0.9995464 0.9984809 0.8993598
#> 0.9 4.441283e-05 0.9995373 0.9986279 0.8874884
#> 0.9 1.025994e-04 0.9995039 0.9989056 0.8706762
#> 0.9 2.370179e-04 0.9993457 0.9992404 0.8513852
#> 0.9 5.475421e-04 0.9988153 0.9996488 0.8189997
#> 0.9 1.264893e-03 0.9975457 0.9999837 0.7435790
#> 0.9 2.922068e-03 0.9935624 1.0000000 0.5850729
#> 0.9 6.750356e-03 0.9805630 1.0000000 0.4243379
#> 0.9 1.559420e-02 0.9217851 1.0000000 0.0000000
#> 0.9 3.602462e-02 0.8502785 1.0000000 0.0000000
#> 1.0 1.922525e-05 0.9995488 0.9984074 0.9025752
#> 1.0 4.441283e-05 0.9995466 0.9984727 0.8993598
#> 1.0 1.025994e-04 0.9995347 0.9986360 0.8869934
#> 1.0 2.370179e-04 0.9994747 0.9989872 0.8686973
#> 1.0 5.475421e-04 0.9990646 0.9992731 0.8412501
#> 1.0 1.264893e-03 0.9982243 0.9998857 0.7888309
#> 1.0 2.922068e-03 0.9951431 1.0000000 0.6357618
#> 1.0 6.750356e-03 0.9841582 1.0000000 0.4614276
#> 1.0 1.559420e-02 0.9167382 1.0000000 0.0000000
#> 1.0 3.602462e-02 0.8502785 1.0000000 0.0000000
#>
#> ROC was used to select the optimal model using the largest value.
#> The final values used for the model were alpha = 1 and lambda = 1.922525e-05.
10-fold cross-validation training resulted in a best-threshold of 1.0 for the model when ROC was used as the performance metric.
caret::confusionMatrix(log.cv.model)#> Cross-Validated (10 fold, repeated 2 times) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction No Yes
#> No 96.6 0.3
#> Yes 0.2 2.9
#>
#> Accuracy (average) : 0.9953
result = evalm(log.cv.model)#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***
#> Group 1 Optimal Informedness = 0.986587402018881
#> Group 1 AUC-ROC = 1
result$rocThe Logistic Regression ROC-AUC for the 10-fold cross-validated training data is: 1.0.
Train the LDA model using 10-fold cross validation. Tuning performed using ROC.
set.seed(1976)
lda.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "lda", trControl=trctrl, tuneLength = 10)#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
lda.cv.model#> Linear Discriminant Analysis
#>
#> 63241 samples
#> 5 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times)
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results:
#>
#> ROC Sens Spec
#> 0.994494 0.9992241 0.8466895
caret::confusionMatrix(lda.cv.model)#> Cross-Validated (10 fold, repeated 2 times) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction No Yes
#> No 96.7 0.5
#> Yes 0.1 2.7
#>
#> Accuracy (average) : 0.9943
result.lda = evalm(lda.cv.model)#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***
#> Group 1 Optimal Informedness = 0.908116090617833
#> Group 1 AUC-ROC = 0.99
result.lda$roc The LDA ROC-AUC for the 10-fold cross-validated training data is: 0.99.
Train the QDA model using 10-fold cross validation. Tuning performed using ROC.
set.seed(1976)
qda.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "qda", trControl=trctrl, tuneLength = 10)#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
qda.cv.model#> Quadratic Discriminant Analysis
#>
#> 63241 samples
#> 5 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times)
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results:
#>
#> ROC Sens Spec
#> 0.9973852 0.9977785 0.8949105
caret::confusionMatrix(qda.cv.model)#> Cross-Validated (10 fold, repeated 2 times) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction No Yes
#> No 96.6 0.3
#> Yes 0.2 2.9
#>
#> Accuracy (average) : 0.9945
result.qda = evalm(qda.cv.model)#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***
#> Group 1 Optimal Informedness = 0.945397391140487
#> Group 1 AUC-ROC = 1
result.qda$rocset.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 1:21))#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
knn.cv.model#> k-Nearest Neighbors
#>
#> 63241 samples
#> 5 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times)
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k ROC Sens Spec
#> 1 0.9761098 0.9984809 0.9428852
#> 2 0.9898714 0.9983093 0.9453543
#> 3 0.9945459 0.9984972 0.9537555
#> 4 0.9968216 0.9984809 0.9540092
#> 5 0.9973851 0.9984400 0.9567332
#> 6 0.9980277 0.9983339 0.9601948
#> 7 0.9984119 0.9983175 0.9621738
#> 8 0.9986689 0.9983502 0.9589584
#> 9 0.9989239 0.9983094 0.9579757
#> 10 0.9991942 0.9983502 0.9557492
#> 11 0.9992105 0.9983829 0.9569819
#> 12 0.9994715 0.9983910 0.9564881
#> 13 0.9994774 0.9983829 0.9562381
#> 14 0.9994872 0.9983910 0.9554980
#> 15 0.9994948 0.9984482 0.9545091
#> 16 0.9994968 0.9983992 0.9552456
#> 17 0.9995010 0.9984155 0.9547554
#> 18 0.9995001 0.9983829 0.9537641
#> 19 0.9994956 0.9984155 0.9540104
#> 20 0.9994977 0.9983910 0.9537665
#> 21 0.9994940 0.9983747 0.9532690
#>
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was k = 17.
set.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 17:35))#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
knn.cv.model#> k-Nearest Neighbors
#>
#> 63241 samples
#> 5 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times)
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k ROC Sens Spec
#> 17 0.9995010 0.9984155 0.9550029
#> 18 0.9995001 0.9983992 0.9535166
#> 19 0.9994956 0.9984155 0.9540104
#> 20 0.9994977 0.9984237 0.9527728
#> 21 0.9994940 0.9983747 0.9535166
#> 22 0.9994928 0.9983992 0.9532703
#> 23 0.9994911 0.9983339 0.9545054
#> 24 0.9994866 0.9983665 0.9542604
#> 25 0.9994843 0.9983093 0.9540116
#> 26 0.9996071 0.9983420 0.9537653
#> 27 0.9997275 0.9983502 0.9545067
#> 28 0.9997262 0.9983665 0.9542616
#> 29 0.9997247 0.9983420 0.9535178
#> 30 0.9997221 0.9983420 0.9542604
#> 31 0.9997199 0.9983420 0.9542604
#> 32 0.9997187 0.9983093 0.9530227
#> 33 0.9997147 0.9983012 0.9532727
#> 34 0.9997127 0.9983093 0.9532727
#> 35 0.9997103 0.9982685 0.9535202
#>
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was k = 27.
set.seed(1976)
knn.cv.model = train(ClassBinary ~ Blue+Green+Red+GBSqr+RBSqr, data = train, method = "knn", trControl=trctrl, tuneGrid = expand.grid(k = 27:51))#> Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
#> in the result set. ROC will be used instead.
knn.cv.model#> k-Nearest Neighbors
#>
#> 63241 samples
#> 5 predictor
#> 2 classes: 'No', 'Yes'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 2 times)
#> Summary of sample sizes: 56916, 56917, 56917, 56917, 56917, 56917, ...
#> Resampling results across tuning parameters:
#>
#> k ROC Sens Spec
#> 27 0.9997275 0.9983502 0.9545067
#> 28 0.9997262 0.9983502 0.9537641
#> 29 0.9997247 0.9983420 0.9535178
#> 30 0.9997221 0.9983420 0.9535190
#> 31 0.9997199 0.9983420 0.9542604
#> 32 0.9997187 0.9983093 0.9525277
#> 33 0.9997147 0.9983012 0.9532727
#> 34 0.9997127 0.9983175 0.9525326
#> 35 0.9997103 0.9982685 0.9532727
#> 36 0.9997113 0.9982440 0.9532763
#> 37 0.9997088 0.9982685 0.9525326
#> 38 0.9997065 0.9982440 0.9508011
#> 39 0.9997040 0.9982930 0.9508011
#> 40 0.9997064 0.9983012 0.9495647
#> 41 0.9997078 0.9983094 0.9498098
#> 42 0.9997053 0.9983257 0.9490684
#> 43 0.9997041 0.9982930 0.9480808
#> 44 0.9997015 0.9982848 0.9478332
#> 45 0.9997018 0.9983012 0.9475857
#> 46 0.9997006 0.9982848 0.9463530
#> 47 0.9997002 0.9983094 0.9473382
#> 48 0.9996986 0.9982930 0.9446191
#> 49 0.9996975 0.9983012 0.9461018
#> 50 0.9996963 0.9982930 0.9448678
#> 51 0.9996945 0.9982930 0.9453617
#>
#> ROC was used to select the optimal model using the largest value.
#> The final value used for the model was k = 27.
caret::confusionMatrix(knn.cv.model)#> Cross-Validated (10 fold, repeated 2 times) Confusion Matrix
#>
#> (entries are percentual average cell counts across resamples)
#>
#> Reference
#> Prediction No Yes
#> No 96.6 0.1
#> Yes 0.2 3.1
#>
#> Accuracy (average) : 0.9969
result.knn = evalm(knn.cv.model)#> ***MLeval: Machine Learning Model Evaluation***
#> Input: caret train function object
#> Averaging probs.
#> Group 1 type: repeatedcv
#> Observations: 63241
#> Number of groups: 1
#> Observations per group: 63241
#> Positive: Yes
#> Negative: No
#> Group: Group 1
#> Positive: 2022
#> Negative: 61219
#> ***Performance Metrics***
#> Group 1 Optimal Informedness = 0.989088354922491
#> Group 1 AUC-ROC = 1
result.knn$roc
From 1 - 51, the best k is 27. The tables of ROC, Sensitivity and Specificity were reviewed for each cross-validation training. From these tables one can see that the improvements within the range are in the hundredths of a percentage point of ROC, so k’s in the range of 10 - 51, are all reasonable selections for the cross-validated training data.
NOTE: PART II same as above plus add Random Forest and SVM to Model Training.
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
| Log Reg | N/A | 1.0 | 1.0 | 0.9953 | 0.998 | 0.094 | 0.997 |
| LDA | N/A | 0.99 | 0.9943 | 0.999 | 0.156 | 0.995 | |
| QDA | N/A | 1.0 | 0.9945 | 0.998 | 0.094 | 0.997 | |
| KNN | k = 17 | 1.0 | 0.997 | 0.998 | 0.0313 | 0.999 | |
| Penalized Log Reg | |||||||
| Random Forest | |||||||
| SVM |
Load data, explore data, etc.
| Model | Tuning | AUROC | Threshold | Accuracy | TPR | FPR | Precision |
| Log Reg | N/A | TBD | TBD | TBD | TBD | TBD | TBD |
| LDA | N/A | TBD | TBD | TBD | TBD | TBD | TBD |
| QDA | N/A | TBD | TBD | TBD | TBD | TBD | TBD |
| KNN | k = TBD | TBD | TBD | TBD | TBD | TBD | TBD |
| Penalized Log Reg | TBD | TBD | TBD | TBD | TBD | TBD | TBD |
| Random Forest | TBD | TBD | TBD | TBD | TBD | TBD | TBD |
| SVM | TBD | TBD | TBD | TBD | TBD | TBD | TBD |